District-Level Forecast of Achieving Trachoma Elimination as a Public Health Problem By 2030: An Ensemble Modelling Approach

Abstract Assessing the feasibility of 2030 as a target date for global elimination of trachoma, and identification of districts that may require enhanced treatment to meet World Health Organization (WHO) elimination criteria by this date are key challenges in operational planning for trachoma programmes. Here we address these challenges by prospectively evaluating forecasting models of trachomatous inflammation–follicular (TF) prevalence, leveraging ensemble-based approaches. Seven candidate probabilistic models were developed to forecast district-wise TF prevalence in 11 760 districts, trained using district-level data on the population prevalence of TF in children aged 1–9 years from 2004 to 2022. Geographical location, history of mass drug administration treatment, and previously measured prevalence data were included in these models as key predictors. The best-performing models were included in an ensemble, using weights derived from their relative likelihood scores. To incorporate the inherent stochasticity of disease transmission and challenges of population-level surveillance, we forecasted probability distributions for the TF prevalence in each geographic district, rather than predicting a single value. Based on our probabilistic forecasts, 1.46% (95% confidence interval [CI]: 1.43–1.48%) of all districts in trachoma-endemic countries, equivalent to 172 districts, will exceed the 5% TF control threshold in 2030 with the current interventions. Global elimination of trachoma as a public health problem by 2030 may require enhanced intervention and/or surveillance of high-risk districts.

Trachoma, an ocular disease caused by Chlamydia trachomatis, is the leading cause of infection blindness in the world [1].The WHO recommended the SAFE strategy for trachoma control, which includes surgery to treat trachomatous trichiasis, antibiotics to clear active infection, facial cleanliness, and environmental improvement [2,3].Although this strategy has empirically been shown to be effective in reducing disease burden [4], interventions may need to be intensified in areas with persistent infection [5].
The WHO Alliance for the Global Elimination of Trachoma by the year 2020 (GET2020) was launched in 1996 [6].The alliance supports and collaborates with the WHO to support Ministries of Health and Non-Governmental Development Organizations' programmatic efforts toward eliminating trachoma as a public health problem by 2020 [7].Elimination of trachoma as a public health problem (EPHP) requires a prevalence of trachomatous inflammation-follicular in children aged 1-9 years (TF 1-9 ) of <5%, sustained for at least two years in the absence of ongoing antibiotic mass treatment, in each formerly endemic district [8,9].However, by 2020 only 15% of endemic countries had met the EPHP target, and thus global EPHP was not attained.The road map for neglected tropical diseases 2021-2030, endorsed by the World Health Assembly in 2020, set 2030 as the new target date for global EPHP [10].With 22% of endemic countries having achieved the target as of 2022 [3], it is still unclear if global elimination by 2030 is attainable.Forecasting the prevalence of trachoma in endemic countries using statistically robust methods represents a useful tool for ascertaining the feasibility of the 2030 deadline and identifying areas that might benefit from enhanced control activities.
Both mechanistic and statistical models have been used to forecast the prevalence of infectious diseases, including trachoma.Most commonly, mechanistic models such as the susceptible, infected, susceptible (SIS) model structure have been used in trachoma modeling [11][12][13][14][15][16].However, forecasts generated by mechanistic models do not necessarily perform better than forecasts generated by purely statistical models [17].There has been a recent shift towards probabilistic forecasts that provide both a predicted outcome and the uncertainty of the prediction [18].Additionally, several studies have demonstrated that ensemble models used to forecast disease prevalence may perform better than their constituent models [19][20][21].
In the present study, we use available survey data to build an ensemble of probabilistic models to forecast the prevalence of TF 1-9 in 2030.In addition to evaluating whether 2030 is a reasonable goal for global EPHP, the ensemble can be used to forecast TF 1-9 in regions without prior surveys and identify regions where enhanced intervention may be required to meet EPHP by 2030.

Data
Implementation unit (IU) level TF 1-9 and trachomatous trichiasis (TT) prevalence surveys and mass drug administration (MDA) distribution data were accessed through the GET2020 database for each IU.The GET2020 database is maintained by the International Trachoma Initiative (ITI) in partnership with WHO.An IU is defined as an administrative unit at which trachoma implementation activities, such as surveys and mass drug administration take place, and typically contains 100 000-250 000 people [9].Additionally, IU-level geographic information was provided by ITI.
The available data were divided longitudinally into 3 parts: a training data set (2004-2018), a scoring data set (2019-2021), and an evaluation data set (2022-2023).All the candidate models were trained on the training data set.The scoring and subsequent assignment of weights for our ensemble models was based on the scoring data set.The performance of the generated ensembles and the candidate models was assessed based on the evaluation data set (Figure 1A).

Candidate Models
Seven candidate models were considered for inclusion in the ensemble, each of which is an IU-specific probabilistic model (Figure 1B).That is, each candidate model predicts a probability that TF has a range of specific values for each district for each year.Forecasts were provided as probability distribution functions (PDFs) for each year of 2019-2030.
We use 2 families of distribution to describe the observed frequency distributions.First, the truncated exponential distribution was used to model TF 1-9 observation as a function of the most likely value for TF 1-9 .This distribution was chosen as it is the maximum entropy distribution given the prevalence range of 0%-100%, and the mean of the distribution [22].Additionally, the exponential distribution is consistent with theoretical models of disease elimination [14,23,24] and observed trends [25,26].
Second, a uniform distribution over the range of 0%-100% TF 1-9 prevalence is used as a simple baseline model to assess the predictive accuracy of candidate models.Candidate models were included in the ensemble if they performed better than this uniform model.This, along with the ensemble weighting described below, provides a quantitative evaluation that models maintain predictive ability.

Spatial aggregation-based models:
There are many different spatial scales at which we could choose to aggregate the data to give a frequency distribution of TF prevalence-for example districts within a country, regions or the whole continent.For each geographic aggregation, a maximum likelihood fit is determined by assuming that the TF 1-9 values for each year follow an exponential distribution that is truncated to allow TF 1-9 values of 0% to 100%.The mean of the underlying (non-truncated) distribution is modeled to follow an exponential decay as a function of year.
(I) Country-level aggregation: For the country model, the unit is a country, except for countries with <50 surveys in the training data.All countries that did not meet this threshold were considered one cluster.(II) Region-level aggregation: Analogous to country-level aggregation, the unit here is within-country regions, as defined within the GET 2020 database.Regions with <30 surveys in the training data\were aggregated at the country level.Whenever aggregation on a country-level also failed to reach the threshold of 30 surveys, the associated surveys were included in one catch-all cluster.(III) Global model: For the global model, no geographic aggregation is performed.This generates an IU-agnostic forecast that has no spatial variation but varies over time.

Regression based models:
As before, the model for the PDF of the observed TF 1-9 values is a truncated exponential distribution.However, for these models, the mean of the distribution is obtained by training a linear regression model for the log of TF 1-9 prevalence.
(I) Fixed effects model: Previous TF 1-9 prevalence, regional average of TF 1-9 , Year (as proxy for baseline trend), and the number of rounds of MDA previously delivered were used to train a district-specific model for TF 1-9 on the logarithmic scale.Final variable selection for the candidate model was achieved via stepwise regression using Bayes Information Criterion (BIC).
(II) Mixed effects model: The same method as for the fixed-effects model was implemented with an additional random intercept term for each country.(III) Geospatial model: A geostatistical model [27] using a Matern spatial covariance function and year as the only other covariate was used to model the district-wise TF 1- 9 prevalence on the logarithmic scale.
Naive Exponential model: The average TF 1-9 prevalence of all surveys in the last year of the training data was used as the mean of an exponential fit.The resulting model is a forecasting model that is the same for each district and does not change with time.

Scoring
To evaluate probabilistic forecasts, we utilized 2 proper scores, the logarithmic score (LogS) and continuous ranked probability score (CRPS) [28].Proper scores incentivize honest forecasting-that is, the best score is obtained by reporting the true distribution [29].When scoring a data set for a specific model, the individual scores for each TF 1-9 prevalence in the dataset are summed.Lower scores represent better forecasts.
Sensitivity analysis was performed to look for variation in scores based on which score is used and for the years included in the scoring dataset.

Ensemble Model
Ensembles were created as weighted averages of candidate models.The weight, w, that a candidate model, c i , contributes to the ensemble is related to the likelihood of the data for each candidate model, given by the exponential of the overall LogS score.The weights were tempered by a constant, k, to limit the over-contribution of any one model to the ensemble.

Model Outputs
The forecast of each candidate model and the resulting ensembles were represented as a probability distribution for each IU for each year.This format captures the prediction and the uncertainty of the forecast.Different summary metrics can then be constructed for each district, such as the probability that a district will reach the TF 1-9 < 5% target for EPHP by 2030.All analyses were performed using R Statistical Software [30].

Data
Trachoma survey data and MDA distribution data was retrieved from the GET2020 database as of February 2023, containing data from 1985 to 2023.After excluding any data points that were incomplete for TF 1-9 prevalence, survey year or IU, and excluding data prior to 2004 when TF 1-9 diagnosis was less uniform, 5898 surveys in 2597 IUs across 25 countries were included in the study (Table 1).Of these surveys, 74% were included in the training data set (2004-2018), 19% in the scoring data set (2019-2021), and 7% in the evaluation data set (2022-2023).
The distribution of TF 1-9 prevalence over the years shows an increasing density toward the left, that is, an increasing probability of TF 1-9 < 5% over time (Supplementary Figure 1).

Candidate Model Evaluation
All 7 candidate models described in the methods section were eligible to be included in the ensembles because they all outperformed the uniform model.Excluded models that performed worse than the uniform model included those with a Gaussian-shaped probability density function.The poorly performing models presumably did not represent uncertainty in TF 1-9 measurement adequately.The ranks based on both scoring rules were fairly consistent, with regression-based models outperforming the aggregation models.The fixed effects model was the best performing (lowest scoring) model based on both scoring rules.Additionally, while the models performed consistently on all years in the scoring data set based on LogS, the models consistently performed worse in 2019 based on CRPS (result not shown).

Ensembles Models and Their Evaluation
When determining the mix of candidate models used in our ensemble models, tempering with low values of k effectively   excluded all but the fixed effects model.Increasing the value of k (k = 1000) made the weights more uniform.This decreased the maximum contribution of a single model from 100% to 20% (Figure 2A).Based on both scoring rules, the ensemble with k = 100 was the best performing model on the evaluation dataset.Consistent with the use of ensemble models in other applications, the ensemble model performed better than its constituent models.Scores for the candidate models on the evaluation data set were consistent with that on the scoring data set, with the regression-based models outperforming the clustering-based models (Figure 2B).

Global EPHP 2030
Based on the forecast from the ensemble with k = 100, it is expected that with current treatment, 1.46% [1.43%-1.48%](172/ 11 760) IUs in previously endemic countries will not attain elimination as a public health problem by 2030 (Figure 3).Thus, global EPHP by 2030 will likely require enhanced intervention and/or surveillance in high-risk IUs.

DISCUSSION
According to our forecast, approximately 172 IUs will not attain EPHP by 2030 under current trends.Enhanced interventions, such as biannual MDA, or novel interventions, such as a vaccine, could be particularly useful in the IUs not expected to reach control under current trends.Such interventions could be modeled to assess their adequacy to help these IUs meet the target.Additionally, using map information and data from other IUs in the region, we can predict TF 1-9 in areas where no surveys have taken place.This insight could be leveraged to plan targeted future surveys in those regions.
Because it informs optimal distribution of treatment and other interventions, the value of district-specific forecasting in modeling the prevalence of TF 1-9 may be most pronounced as we approach global EPHP.Meanwhile, forecasting is generally more accurate in the short term and enhanced interventions could improve outcomes [31,32].This warrants mid-term goals that can be used to measure progress and revisiting the 2030 forecasts periodically with updated data.
A key strength of our modeling approach is its dynamic nature-each model can be re-trained as more survey data becomes available, making updated forecasts responsive to global and local changes in trachoma prevalence and trends.In addition to being able to include new data, additional models can also be added to the ensemble to improve the accuracy and certainty of the forecast.However, due to the statistical nature of the forecasting method, mechanistic conclusions such as intervention efficacy or the secular trend cannot be drawn directly from the ensemble model.
The ensemble outperformed its constituent models on the evaluation dataset, based on both scoring rules used.This effect was more pronounced on CRPS, indicating that the ensemble reduces the forecasting error.Additionally, tempering the weights that each candidate model contributes to the ensemble improved the performance as compared to a non-tempered (k = 1) model.This indicates that the best performing candidate model may vary each year, potentially attributable to a non-random sample of IUs being surveyed each year.
IU-specific models performed better than the IU-agnostic models (global and naïve exponential models) on both the scoring and evaluation datasets, indicating IU-wise heterogeneity.Additionally, the geospatial model was the most consistently performing model across years in the scoring dataset, suggesting that smoothing over neighboring IUs adjusts for some of the stochasticity of TF 1-9 .
In the present study we chose TF 1-9 as the prevalence marker for forecasting due to its integral role in WHO's definition of elimination of trachoma as a public health problem.However, TF 1-9 has a low sensitivity for active infection because the postinfectious inflammation can take months to resolve [33].Additionally, TF 1-9 factors such as inter-rater variability make it a noisy measure.This adds to the complexity of making public health decisions based on TF 1-9 alone and motivates an assessment of bias in TF 1-9 measurements.The incorporation of other prevalence measures, such as Trachomatous Trichiasis and Anti-Pgp3 serology might alleviate some challenges in forecasting IU-specific TF 1-9 and the relationship to ongoing transmission.
The models included a host of predictors, both explicitly and implicitly.Factors such as spatial variability [34], MDA interventions [35,36], and secular (ie, not dependent on MDA) decline of trachoma prevalence [37] are included explicitly as covariates in various candidate models.On the other hand, the specific impact of interventions such as facial cleanliness, and environmental improvement (F&E) [38] are included by proxy as these interventions are primarily implemented together with MDA.Additionally, F&E are believed to be a contributor to the secular trend.
Each candidate model has its own limitations-such as the global model not accounting for geospatial trends, the nonindependence of observations for regression-based models, and the ad-hoc clustering of IUs in the aggregation models.However, the creation of ensemble models draws on the strengths of each of the candidate models by weighting them based on the accuracy of their predictions.
Although WHO's 2030 goal of global EPHP may seem arbitrary, current results suggest that it is achievable in most regions.Enhanced intervention may be required to meet the goal in certain regions, and district-wise forecasts can be used to target IUs that might benefit the most from such interventions.

Figure 1 .
Figure 1.A, A schematic of the methodology used to derive IU-wise probabilistic forecasts of trachoma prevalence.The method evaluates the accuracy of a suite of candidate models and then combines the highest performing ones to form an ensemble forecast.B, Sample forecast for a sample district for 2030 is represented as the probability distribution function defined over the range of 0%-100% TF 1-9 .The candidate models (light gray lines), that perform better than the uniform model (dotted gray line), are included in the ensemble model (black line).By forecasting the distribution of observed TF values, various metrics can be calculated, such as the probability of achieving EPHP by 2030.Abbreviations: EPHP, elimination of trachoma as a public health problem; IU, implementation unit; TF, trachomatous inflammation-follicular.

Figure 2 .
Figure 2. (Top panel): Weight of each candidate model in the ensembles by tempering constant, k.Higher weights indicate more influence in the ensemble model.Higher levels of tempering reduce the influence of any one model by reducing the scale of the scores that candidate models can achieve.(Bottom panel): Cumulative log score of the ensembles and 7 candidate models based on the evaluation data set (2022-2023).Lower scores represent better agreement with data.The numbers indicate the overall rank of each model.

Figure 3
Figure 3 Map of Africa depicting the probability that TF will be <5% in 2030, which would satisfy the WHO requirement for elimination of trachoma as a public health problem.Results are based on the ensemble model with k = 100.Darker colors depict a lower probability of TF being controlled by the target date.Yellow areas of the map depict areas with a high probability of control.Abbreviations: TF, trachomatous inflammation-follicular; WHO, World Health Organization.